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AESTRACT 

Using string-manipulation algorithms, FORTRAN 
computer programs were designed for analysis of written material. The 
Frograms measure length of a text and its comFlexity in terms of the 
average length of words and sentences, map the occurrences of 
keywords or phrases, calculate word frequency distribution and 
certain indicators of style. Trials of the programs, in studies of 
readability and reading rate, in aiding editors, in grading essays, 
and in Identifying sources of-response bias in multiple choice tests 
demonstrate the potential applications of these algorithms in 
educational research and the usefulness of augmenting FORTRAN" s 
computational facilities with character-processing capabilitv 
(Author) - ' ' 
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The title of this talk is ''A Scheme for Text 
Analysis Using Fortrau" . If you have been drawi here by 
the error In the program that announces --A Scheme for 
TEST Analysis", let" me encourage you to stay. You will - 
hear about scoring tests as well as about analyzing texts. 

Our scheme for text analysis with Fortran is 
"based on the notion of English words as strings. ¥e are 
Indebted to a colleague, C. P. Dnagna of Bell Laboratories, 
for suggesting this idea. I will explain the string 
concept in detail and then describe how we have used 
strings in writing Portran program^s to measure readability, 
develop editing aids and score tests. 

The main point of my talk, however, is not tw 
describe our programs, which were written to meet specific 
research needs. Rather, we believe that by using the 
string concept, any educational researcher who has access 
to Portran can write his otm programs tailored to his own 
text-analysis needs. 

Let me begin by pointing out what we are not 
talking about. ¥e are not describing a higher-level 
language, such as Snobol^ designed specifically for character 
manipulation. We are not talking about a package of programs 
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such as the- General Inquirer, designed to perform a 
specific kind of text-analysis. We are talking about 
adding an alphanumeric character-manipulation capability " 
to Portranj a language designed to per.form numeric 
operations. We feel this adaptation 1b useful because 
Fortran Is a wldely-knovm programming language available 
at most computer Installations. 

The string concept Is described in Pig. 1 of 
the handout. A string Is a sequence of adjacent 
alphanumBrlc characters defined by three parts. The 
string's character-sequence Is a subset of a collection 
of characters we call the character- collection. Notice 
that these characters include letters, numbers, punctuation 
marks and blanks. The position of the character- sequence 
in the character-collection is defined by a string's pointer ' 
and length. The pointer points to the character poBltlon in 
the collection of the first character of the character-, 
sequence. The string's length equals the number of characters 
In the' character-sequence. 

In Fig. -1, the character-sequence, -mouse-, is 
defined by a pointer of 5 and length of 5. A pointer of 1 
and length of 25 defines a character- sequence Identical to 
the character-collection in the example. The character- 
collection remains the same regardless of changes in pointer 
and length values. Any subset of adjacent characters can be 
■brought Into focus by setting the pointer and length to 



values that define the characters as a sequence.' English 
words or nonsense syllables can be character-sequences. 
By varying the pointer and length values, a. researcher may 
examine the same piece of text in a variety of ways. 

Those of you who are familiar with Fortran may 
wonder how it is possible to manipulate a single character 
with standard Fortran.' It Isn't possible. Fortran's basic 
storage unit is the computer word. Therefore^ Fortran' s 
. word-handling capabilities must be augmented with two 
character-manipulating subroutines. One -of these routines 
stores a single character within a computer word- the other 
routine fetches a single character from within a computer 
word. These character-manipulation routines exist as system 
subroutines at many computer installations. If such ready- 
made subroutines -are not available, they can be written by 
anyone fa,millar with assembly language coding. 

Figures 2, 3 and k of the handout show how we have 
used the string concept In writing programs. Tlje string Is 
stored in a Fortran array, as shown in Fig. 2. The first 
computer word of the array holds the string's pointerj the 
second word holds the string's length and the remaining 
computer words of the array hold the string's character- 
collection.. Different strings may be referred to by changing 
the values of the first two coir iter word-s of the array. 

Figure 3 of the handout shows one basic operation of 
our text-analysis programs - segm.enting a text string into word 
and sentences. A text Is segmented by examining each character 
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in sequence to see if it Is one of a special set of characters 
we call breaks and enders. 'The segments of text delimited by 
breaks and enders are called words and sentences. Exaniple A 
uses standard punctuation marks as breaks cmd enders. In 
this cassj the text is segmented into words and sentences as 
we normally think of them, ^ However^ other sets of breaks and 
enders may be used. Example C Fhows'the segmentation of the 
text into words and sentences that do'not conform to our 
normal usage. If you attempt this scan, you will appreciate 
^the computer^ s advantage over man in segmenting texts according 
to special rules. 

Each word or sentence segment has a pointer and . 
length that defines it as a character- sequence within the text 
string. The values of the pointers and lengths can be saved 
for later use. These values can act as entries to the words 
and sentences stored within the character-collection of a 
text string, ■ 

That a single character-collection can serve as the 
source for many different charac ter- sequences is an important 
feature .f the string concept. The use of this feature in 
programming is illustrated in Fig. 4. . . 

The purpose of the operation described in Fig, 4 is 
to store one copy of each word type found in the input records 
of a text. In the example^ an input record has been stored 
as a string in the array^ ITXT, The array, IWORD, at the 
top of the page holds one copy, of each word type. Each time 
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a new word type Is found Its character- sequence is moved 
from ITXT to the end of the B70RD character^collectlon. The 
pointer to the heglnnlng of the new word^ s character-sequenca 
in IWORD and its length are stored In lEEP, 

In the figure, the flow of the program has reached 
the decision about the word -^the- In IIXT, Is there a copy 
of this word in IWORD? This decision Is made by comparing 
the string in ITXT with each string whose pointer and length 
are In IREF, The character-collection of IWORD is the 
character-collection for all the strings 'in IREP, 

Two strings are identical if their character^ sequences 
match. The determination of identity normally requires a 
character-"by=character comparison of the strings* Howeverj 
certain programming strategies may shorten search time. For 
example^ there is no need to compare strings of different 
lengths. 

The location in IREP of a word^s pointer and length 
may serve as the word's nmneric Identifier in programming* 
Por^example^ in the lower right-hand corner of Fig. word 
numbers have been stored in ISEN to represent the sequence of 
words^ in ITXT. The pattern of numbers in ISEW might be used 
to recover recurrent phrases, in a text* Or the mimerlc 
identifiers of the words could be used In making frequency 
counts of the different word types in a text* 

Since each word type^s character-sequence can be 
referenced, through its pointer and lengthy a word^ s numeric 
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identifier also serves as an entry for retrieving the word^s 
characters. These can be used in printing the word. 

The programs that we have written to perform the 
operations of string storage^ assignment of numeric 
identifiers and string retrieval are summarized in Table 1 of 
the handout. These utility subroutines have been used to 
write a" number of general purpose text-analysis programs 
described in Table 2. Let me reemphaslze that our^text- 
analysis programs are only examples. Different research needs 
would and should produce different types of programs. 

I would now like to Illustrate some of the uses of 
the EZIER programs described In Table 2. . These Illustrations 
are Intended only to suggest the range of analyses that can 
be undertaken with programs using the string concept* 

The LETVOW program describes a text in terms of Its 
overall length and Its readability. The program estimates the 
number of syllables in a passage from a count ^of the number of 
vowels. The readability indices that the program calculates 
a^e mean sentence length In words and mean word length In 
syllables. While these measures can be made by hand^ we have 
found that LETVOW does them with greater ease and with much 
greater accuracy when many long texts must be processed. 

Figure 5 of the handout illustrates one way that 
LETVOW^ s analysis of individual sentences could be used to 
produce more readable texts. The idea behind the plots showi 
In Pig. 5 Is that writers might communicate Information' more 
effectively if they avoided the use of long words in long 
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sentences. The plots show the Joint distribution of aentence 
length in words and average word length in syllables for all 
sentences In two different texts. The numbers by each point 
are sentence reference numbers from our LOG program. This 
program prints a text with each sentence numbered. The 
• lines. dividing the plots into quadrants are used to define 
■ long sentences and long words. The definition of quadrants 
could be based on knowledge about readability. This kind of . 
display could be used by an editor to locate sentences in 
the LOG printout that are too long or use lengthy words. 
Sentences lying in the upper right-hand quadrants should be 
especially good candidates for rewriting. Whether this 
editing procedure would produce more comprehensible texts 
has yet to be explored. 

The FINDR program has also been generally useful. 
For example, we explored the possibility that certain words 
are diagnostic of the knowledge state of a writer. Hiller 
has suggested that words and phrases denoting vagueness would 
be good barometers of knowledge. We looked at short essays 
describing all that the writers could remember about material 
they Just read. We found that Hiller' s vagueness terms ' 
occurred rarely if at all in these short essays. Therefore, 
we selected words we -felt to be Indicative of vagueness from 
the 100 most frequently occurring words in American English. 
We found that the words, it and they, were used' significantly 
more .often in essays giyen a low score by a human scorer. 
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Since these essays had been L,cored with a checklist, 
we also tried scoring the esiaya by selecting a set of key 
words from the checklist. With our KEYS program we found 
the frequency with which checklist keywords occurred in the 
essays. As Hiller has Just reported, keyword frequencies 
were highly correlated with scores assigned by the human 
scorer. 

We have used our phrase-building program, PRS, to 
look for, sources of response bias in multiple-choice tests. We 
looked at multiple-choice tests of comprehension associated 
.with reading selections from a speed-reading course. We 
wanted to set if the words and phrases from the tests' correct 
alternatives occurred more often in the reading selections than 
the phrases from the error alternatives. If such were the case, 
a reader might choose correct alternatives on a multiple-choice 
test because these answers were more familiar and not becuase 
the reader understood the content of the passage. 

With the PES program we searched through each reading ' 
selection for the alternative answers in that sfelectlon' s test. 
We used entire alternatives and components of the alternatives 
as phrases. This is Illustrated In Pig. 6. The results of ' ' 
this analysis are shown in Table 3. They suggest that students 
saw proportionally more of the correct alternatives'' phrases 
in the post-lesson reading selection than in the pre-lesson 
.selection. This Imbalance might predispose a student to 
perform better on the post-lesson test" than on the pre-lesson 
test. ■ ■ ■ • ■ 
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Finally^ we have found our utility subroutines 
useful in writing programs to tabulate data. Plgure 7 
illustrates the tabulation of responses on a cloze test. 
These responses were packed onto cards for Input to the 
utility subroutines, 3RKRTN and WRDID, which separated 
and classified the responses. The output shown In Fig. 7 
was used to select correct responaes. These correct 
response types served as input to another program that 
scored the responses of the individual subjects. Program- 
ming of this type has made it possible to score large 
numbers of cloze tests quickly and reasonably Inexpensively. 

Our program applications have demonstrated to us 
that Fortran can handle alphanumeric materials efficiently 
and effectively, ¥e also found It particularly useful to 
have Fortran's facility In numerical computation available 
as texts were being analyzed. . • 

Our programs represent only one approach to text- 
analysis. These programs took, the English word, as their 
basic unit of analysis. Other research might focus on 
smaller or larger segments of text. The concept of a string 
would be Just as effective for these sorts of analyses. The 
string concept's generality makes It potentially useful in 
many areas of educational research. 
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A STRING DEFINES A CHARACTER-SEQUENCE AND HAS THREE 
PARTS: 



(1) A POINTER TO THE CHARACTER - PO SIT I ON OF THE 
FIRST CHARACTER OF A CHARACTER- SEQUENCE 

(2) THE LENGTH OF THE CHARACTER = SEQUENCE ; 

(3) A COLLECTION OF CHARACTERS OF WHICH THE 
CHARACTER^ SEQUENCE IS A SUBSET. 



POINTER LENGTH 
5 



CHARACTER COLLECTION 




T H E 
i i I 



MOUSE 



J L 



CHASED THE CAT 

-L-J 1 I I I I I I I 1 III 



1 Z 3 4 § 6 7 8 9 to M 12 13 14 15 16 17 18 19 20 21 22 23 Z4 25 
CHARACTER POSITIONS NUMBERED SEQUENTIALLY 



IN THE EXAMPLE ABOVE, THE CHARACTER^SEQUENCE , MOUSE, 
S DEFINED BY THE STRING. 

A NEW POINTER VALUE OF 11 WITH AN UNCHANGED LENGTH 
LOF 5 WOULD DEFINE THE CHARACTER^ SEQUENCE =CHASE-. 

A POINTER VALUE OF 13 AND LENGTH OF 3 WOULD DEFINE 
THE DEFINE THE CH ARACTER=SEOUENCE - ASE-. 



CHANGES IN THE VALUE OF THE POINTER OR LENGTH DO NOT AFFEC 
THE CHARACTER- COLLECTION . IT REMAINS THE SAME . ' 



FIG. 1 

DIAGRAM OF A STRING 
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DEFINITIONS 



•A BREAK CHARACTER CAN BE ANY CHARACTER. THE SMALLEST USER-DEFINED UNIT, 
A WORD . IS THE LONGEST SEQUENCE OF CHARACTERS OCCURRING BETWEEN TWO BREAK ' 
CHARACTERS OR BETWEEN THE BEGINNING OF AN INPUT RECORD AND A BREAK CHARACTER. 

AN ENDER CHARACTER CAN BE ANY CHARACTER. THE USER-DEFINED UNIT. A SENTENCE . 
IS THE LONGEST SEQUENCE OF CHARACTERS OCCURRING BETWEEN TWO ENDER CHARACTERS 
OR BETWEEN THE BEOINNING OF AN INPUT RECORD AND AN ENDER. 

NOTE: ,# DENOTES A BLANK. ' 



SAMPLE TEXT 

FIRST, REMEMBER THAT YOU KAY CHOOSE BREAKS AND ENDERS.DO YOU UNDERSTAND? 



EXAMPLE 


A 


SEGMEMTATION OF SAMPLE 


TEXT WITH THE 


TERMS, WORD AND SENTENCE 


CORRESPONDING TO 


THEIi^ NORMAL ENGLISH US 


AGE. 


BREAKS SENTENCES 


WORDS 


n 1 


FIRST 




. REMEMBER 


ENDERS 


THAT 


.<? 


YOU 




MAY 




CHOOSE 




BREAKS 




AND 




ENDERS 


I 


DO 




YOU 




UNDERSTAND 



er|c 



EXAMPLE B 

SEGMENTATION OF SAMPLE TEXT USING ONLY 
ONE BREAK CHARACTER AND ONE ENDER^ 



BREAKS 
ENDERS 



SENTENCES 
1 



WORDS 

FIRST, REMEMBER 

THAT 

YOU 

MAY 

CHOOSE 
BREAKS 
AND 

ENDERS.DO 
YOU 

UNDERSTAND 



FIG. 3 

THE USE OF BREAK CHARACTERS AND ENDER CHARACTERS TO SE 
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EXAMPLE C 

SEGMEhiTATION OF SAMPLE TEXT USING 
VOWELS AS BREAK CHARACTERS AND 
D AS THE SINGLE ENDER. 



BRiAKS 
AE I GUY 

ENOERS 
D 



SENTENCES 
1 



3 
4 



WORDS 
F 

RST , R 
MB 

R©-TH 
IB 

m 

#CH 
S 

#BR 
KS# 
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e- 

N 

RS. 
# 
■6- 
N 

RST 
N 



SENTEMCE fe I S UNDEF i NED . THERE I S NO EHDER 
CHARACTER DEF I M I NG THE END OF fa. 



ARRAY IWORD 
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ARRAY ITXT 



18 


3 


THE©DAYeTHEeHOURe 


Ml 






II 1 1 1 II II 1 1 1 II 1 1 






1 2 3 4 5 6 7 8 9 10 iH2 13 14 15 16 17 18 19 20 




LENGTH 

OF WORD 



POINTER 
TO START 
OF WORD IN 
IWORD STRING 

LOCATION IN 
IREF SERVES 
AS NUMERIC 
IDENTIFIER 
OF WORD 



1 2 3 4 5 6 7 8 9 10 




ARRAY IREF 




DEFINE AS A STRING 
THE TARGET WORD 
TO BE STORED. 

IN THIS EXAMPLE, -THE- 
IS THE WORD TO BE 
STORED. IT HAS BEEN 
DEFINED AS A STRING 
IN THE ARRAY ITXT, 



DETERMINE WHETHER THE 
TARGET WORD HAS 
ALREADY BEEN STORED. 

IN THE EKAMPLEyTHE WORDS 
FOUND IN ITXT ARE STORED IN 
I WORD* NOTE THAT A WORD 
THAT 0CCUR3.fVi0RE THAN ONCE 
IN ITXT IS STORED ONLY ONCE 
IN IWORD, EACH WORD TYPE IN 
IWORD IS DEFINED AS A STRING 
WHOSE POINTER AND LENGTH 
ARE STORED IN IREF. THE 
LOCATION IN IREF OF A WORD'S 
POINTER AND LENGTH IS ITS 
NUMERIC IDENTIFIER. THIS 
N U MB E R C A N B E U S ED INSTEAD 
OF THE WORD IN FURTHER 
PROGRAMMING, 



FIG. 4 

WORD STORAGE USING STRINGS AND N.I 
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SAVE THE 
TARGET WORD. 

IN THIS EXAMPLE , THE 
TARGET WORD WOULD 
BE APPENDED TO THE 
CHARACTER-COLLECTION 
IN THE ARRAY IWORD. 
THE NEXT FREE LOCATION 
IN IREF WOULD BE USED 
TO STORE THE VALUES 
OF THE WORD'S POINTER 
AND LENGTH. 




PICK UP THE NUMERIC 
IDENTIFrER OF THE 
TARGET WORD FOR 

FURTHER PROGRAMMING 
USE. 

IN THIS EXAMPLE, THE 
NUMERIC IDENTIFIER OF -THE- 
IS i. THE POINTER AND 
LENGTH OF -THE- ARE STORED 
IN IREF (1). A PROGRAM MIGHT 
USE 1 TO REPRESENT -THE- 
IN A LIST OF ITXT WORDS 
AS SHOWN BELOW IN THE 
I SEN LIST. 
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ARRAY ISEN 



I 1 




tirERlCVND NUMERIC IDENTIFIERS 



Table 2, 



A listing of EZIER utility suferoutlnes 
with brief descriptions of their 
functions. 



PROGRAM 
NUMBER 

'8 



10 



11 



12 



13 



14 

15 
16 



IT 



PROGRAM 
NAME 

LININ 

BRKRTN 

NUMVOW 



PROGRAM gUNCTIONS 



NOHS 



KOMST 



WRDID 



IDRTN 
ICLAS 
OTSTR 



PREH'/D 
(PRERfN) 



Reads in text records and converts them 
to strings. 

SearGhes a string for a speclflo set 
of breaics and/or enders . 

Counts the number of vowels in a string. 



Counts the number of ■ non-blank 
oharacters in an array. 

Oompares two strings to determine If 
their character-sequences are identical. 

1) If a target string has hot been 
encountered previous ly^ the string is 
stored and' assigned a unique . identifi- 
cation number., 

2) If a target string has already been 
encountered, the string's Identlf ication 
number is retrieved. _ 

Retrieves the identification number 
assigned 'to a string by WRDID. 

Classifies a string on the basis of its 
length. 

Prepares a string for multiple-line 
printing so that words will not be 
continued from one printed line to 
the next. 

1) Oalculates the number of six- 
character computer-words taicen up by 
the charaoter-sequenoe of a string. 

2) Appenda blanks to the end of a 
'character-sequence if the end does not 
completely fill a computer word. 



Table 1.1 



PROGRAM PROGRAM 

NUMBER NAME - PROGRAM FUNCTIONS 

21 ^ . ... PNDBKl Searches a string for one of a set 

of target characters stored in. 
another string. 

22 MOVST Moves the character-sequence of one 

. string to the beginning of the M 

■ character-collection of another strln| 

23 APDST' Appends the character-Beauence of one 

string. to the end of the "character- 
sequence of another string* 

24 INSERT . Inserts a set" of characters from an 
(STRING) arre.y Into a string. 
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Table 2 



A listing of the EZIER text -analysis prDgrams 
with brief descriptions of their functions. 



"\ ■ . . . 
PROGRAM PROGRAM . \ 

NUMBER NAME . PROORAM PUNCTIONS 



1. LETVOW 1* Counts the number of letters^ vowels 

and syllables In the text and^-ln 
each sentencei " 

2, Calculates the mean sentence length in 
words and the mean word length In 
•syllables for the text "ahd the mean 
I word length for each sentence of ^the 
■ i text. 

2, PlNDR 1, Calculates the frequency dljtrlbutlon 

of all word types in a text; 
2. Lists word types alphabetlcallyj 
3o Calculates /type^token ratio* ^ ^ 

3* KEYS 1. Searches text for all pccurrences of 

. -. . words specified by the userj 

2* Lists all the words found and their 
frequency of occurrence, 

^* PRS 1., Searches text for all occurrences of 

phrases specified by the user; 
■ 2p Lists all the phrases found and their 

frequency of occurrence, 

5. LENSEN 1, Calculates the distribution of word 

lengths In letters for each sentence 
and the text j 
2. Calculates frequency distribution of 
sentence lengths in words « 

6, ^ KXSDAT 1. Maps the occurrence of user^speclfied 

keywords or groups ; of keywords on the 
the sentences of a text j 
2e Calculates the mean sentence length 
in words and the mean word length 
" . y In syllables, of the i^entences in : 

Which each keyword or^ group of 
keywords occurs. 

T. . LOG 1. Prints text with lines and sentences 

numbered In sequence j 
2e . Compiles a log Indicating the line on 
which each .sentence begins. 
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RESPONSE CARD FOR RESPONSES TO A SINGLE 
TEXT GIVEN BY A SINGLE SUBJECT 
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SAMPLE OF CLOZE-TEST SCORING 



FREQUENCY DISTRIBUTION OF RESPONSES GIVEN FOR 
EACH DELETED WORD (SLOT) IN ATEXT ■ 
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